55 research outputs found
Reducing branch delay to zero in pipelined processors
A mechanism to reduce the cost of branches in pipelined processors is described and evaluated. It is based on the use of multiple prefetch, early computation of the target address, delayed branch, and parallel execution of branches. The implementation of this mechanism using a branch target instruction memory is described. An analytical model of the performance of this implementation makes it possible to measure the efficiency of the mechanism with a very low computational cost. The model is used to determine the size of cache lines that maximizes the processor performance, to compare the performance of the mechanism with that of other schemes, and to analyze the performance of the mechanism with two alternative cache organizations.Peer ReviewedPostprint (published version
Evaluating A+B=K conditions in constant time
The authors consider a type of condition that can be evaluated without requiring a complete ALU (arithmetic logic unit) operation. The circuit that is presented detects the condition A+B=K (n-bit numbers) in constant time, avoiding the carry propagation delay. This circuit can be used to detect a wide spectrum of conditions in branch instructions. It can improve the processor performance by advancing the evaluation of conditions and eliminating the pipeline delays produced by these operations.Peer ReviewedPostprint (published version
Computing size-independent matrix problems on systolic array processors
A methodology to transform dense to band matrices is presented in this paper. This transformation, is accomplished by triangular blocks partitioning, and allows the implementation of solutions to problems with any given size, by means of contraflow systolic arrays, originally proposed by H.T. Kung. Matrix-vector and matrix-matrix multiplications are the operations considered here.The proposed transformations allow the optimal utilization of processing elements (PEs) of the systolic array when dense matrix are operated. Every computation is made inside the array by using adequate feedback. The feedback delay time depends only on the systolic array size.Peer ReviewedPostprint (published version
Vectorized register tiling
In the last years, there has been much effort in commercial compilers (icc, gcc) to exploit efficiently the SIMD capabilities and the memory hierarchy that the current processors offer. However, the small numbers of compilers that can automatically exploit these characteristics achieve in most cases unsatisfactory results. Therefore, the programmers often need to apply by hand the optimizations to the source code, write manually the code in assembly or use compiler built-in functions (such intrinsics) to achieve high performance. In this work, we present source-to-source transformations that help commercial compilers exploiting the memory hierarchy and generating efficient SIMD code. Results obtained on our experiments show that our solutions achieve as excellent performance as hand-optimized vendor-supplied numerical libraries (written in assembly).Peer ReviewedPreprin
Conflict-free strides for vectors in matched memories
Address transformation schemes, such as skewing and linear transformations, have been proposed to achieve conflict-free access to one family of strides in vector processors with matched memories. The paper extends these schemes to achieve this conflict-free access for several families. The basic idea is to perform an out-of-order access to vectors of fixed length, equal to that of the vector registers of the processor. The hardware required is similar to that for the access in order.Peer ReviewedPostprint (author's final draft
Filtering directory lookups in CMPs
Coherence protocols consume an important fraction of power to determine which coherence action should take place. In this paper we focus on CMPs with a shared cache
and a directory-based coherence protocol implemented as a duplicate of local caches tags. We observe that a big fraction of directory lookups produce a miss since the block looked up is not cached in any local cache. We propose to add a filter before the directory lookup in order to reduce the number of lookups to this structure. The filter identifies whether the current block was last accessed as a data or as an instruction. With this information, looking up the whole directory can be avoided for most accesses. We evaluate the filter in a CMP with 8 in-order processors with 4 threads each and a memory hierarchy with a shared L2 cache.We show that a filter with a size of 3% of the tag array of the shared cache can avoid more than 70% of all comparisons performed by directory lookups with a performance loss of just 0.2% for SPLASH2 and 1.5% for Specweb2005. On average,
the number of 15-bit comparisons avoided per cycle is 54 out of 77 for SPLASH2 and 29 out of 41 for Specweb2005. In both cases, the filter requires less than one read of 1 bit per cycle.Postprint (published version
Analysis and simulation of multiplexed single-bus networks with and without buffering
Performance issues of a single-bus interconnection network for multiprocessor systems, operating in a multiplexed way, are presented in this paper. Several models are developed and used
to allow system performance evaluation. Comparisons with equivalent crossbar systems are provided. It is shown how crossbar EBW values can be reached and exceeded when appropriate operation parameters are chosen in a multiplexed
single-bus system. Another architectural feature is considered, concerning the utilization of buffers at the memory modules. With the buffering scheme, memory interference can be reduced so that the system performance is practically improved.Peer ReviewedPostprint (published version
Systematic design of two level pipelined systolic arrays with data contraflow
Many systolic algorithms and related design methodologies
have been recently proposed. Frecuently, in these systolic
algorithms practical considerations are not taken into account.
Equitatively distributed load between processing elements,
pipelined functional units etc, are desirable features when
implementing systolic algorithms.In this paper we present a
design methodology in which these features are considered. As
an example, the methodology is applied to obtain a
problem-size-independent, two-level pipelined 1D systolic
algorithm with data contraflow to efficiently solve triangular
systems of equations.Peer ReviewedPostprint (published version
El optimizador de bucles del compilador Open64/ORC (parte 2)
Open64 y ORC (Open Research Compiler) son dos iniciativas de código abierto basadas en el compilador SGI Pro64. Open64 está gestionada por miembros de la Universidad de Delaware, y ORC es una extensión del compilador desarrollada por Intel y la Chinese Academy of Science. Para más información consultar las respectivas páginas web [2] y [1].
SGI Pro64 es un conjunto de compiladores optimizadores desarrollados por SGI. Incluye compiladores de C, C++ y Fortran90/95 que siguen los estándares ABI y API de Linux IA-64. Los archivos fuente son de dominio público y se distribuyen bajo los términos de la GNU General Public License. El conjunto
de compiladores está disponible para correr sobre plataformas Linux IA-32 e IA-64.
Este documento continúa el trabajo iniciado en los technical reports “Introducción al compilador Open64/ORC” [10] y “El optimizador de bucles del compilador Open64/ORC (parte 1)” [11]. El primero describe los componentes del compilador y la representación intermedia que se utiliza como interficie común entre ellos. El segundo documento se centra específicamente en uno de los componentes del compilador: el optimizador de bucles.Postprint (published version
Source-to-Source transformations for efficient SIMD code generation
In the last years, there has been much effort in commercial compilers to generate efficient SIMD instructions-based code sequences from conventional sequential programs. However, the small numbers of compilers that can automatically use these
instructions achieve in most cases unsatisfactory results. Therefore, the code often has to be written manually in assembly language or using compiler built-in functions to achieve high performance. In this work, we present source-to-source transformations that help commercial vectorizing compilers to generate efficient SIMD code. Experimental results show that excellent performance can be achieved. In particular, for the problem of matrix product (SGEMM) we almost achieve as high performance as hand-optimized numerical libraries. Our source-tosource transformations are based on the scalar replacement and unroll and jam transformations presented by Callahan et all. In particular, we extend the use of scalar replacement to vectorial replacement and combine this transformation with unroll and jam and outer loop vectorization to fully exploit the vector register level and thus to help the compiler to generate efficient SIMD code. We will show experimentally the effectiveness of our proposal.Peer ReviewedPostprint (published version
- …